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Trai ni ng_Bayes_Cl assi f i e r (Exampl e set) 

1. collect all attributes in Example Set 

A <- all distinct attributes in Example Set 
C <~ all distinct values in Example set 

2. calculate the value probabil i ty P (c y ) and conditional probabil ity P (w / |c > ) 

For each target class value C-iw C do 

docs j<r subset of Exampl e_Set for which the target class value is Cj 

Pic-). I<toc *' 



I Examples] 

Text j <— a single document created by concatenating all attributes of 
docs } 

N total number of attributes in Text] 
for each attributes in A 

tik <r number of times attribute Wi occurs in TGXtj 

V ' ! >} n+\A\ 
Classify_Text( input_text, Trai ning_Set) 

1. Access Training_set to get and set 

A «-all distinct attributes in Traing_Set 
C <- all distinct values in Training_Set 

2. counts how many times attributes defined in a occurs in input_text 
positions <- all word positions in input.text that contain tokens found in A 

3. pick one value as final answer 
For each C j € C 

(1) Access traing_set to get value probability / > (c , y)and attribute and 
value conditional probability A* 0 V / | c > ) 

(2) calculace answer probability P (f j |w, , W 2 W fl ) 

c = arg max p{cj\w } , w 2 w (1 )= arg max ^(^,|c y ) 

return c * 



Figure 8 Baycs Classifier Training und Classification Algorithms 
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Figure 12 
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Figure 13 
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Figure 14 
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Figure 15 
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Figure 16 
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CONCEPT-BASED SEARCH AND As a result, there is a pressing need to develop search 

RETRIEVAL SYSTEM engines that bridge the gap between the meaning of an input 

query and pre- indexed documents. Existing approaches will 
not solve this problem, because it is impossible to determine 

BACKGROUND OF THE INVENTION 5 the meaning of input queries from terms alone. A successful 

, c- 1 1 <■ .l i approach must also make use of the structure of the query. 

1. Field of the Invention , y .. , . . . . * , . . j . 

Ideally, documents and queries should both be mapped to a 

The present invention relates to a concept-based search common logical structure that permits direct comparison by 

and retrieval system. More particularly, the present inven- meaning, not by keywords. 

tion relates to a system that indexes collections of docu- p rev ious generations of search engines have relied on a 

ments with ontology-based predicate structures through variety of techniques for searching a database containing the 

automated and/or human-assisted methods. The system m text of the docuraents being searched. Generally, an 

extracts the concepts behind user queries to return only those inverted index is created that permits documents to be 

documents that match those concepts. accessed on the basis of the words they contain. Methods for 

2. Background of the Invention retrieving documents and creating indexes include Monier's 
The Internet, which was created to keep a small group of System for adding a new entry to a web page table upon 

scientists informed, has now become so vast that it is no receiving web page including a link to another web page not 

longer easy to find information. Even the simplest attempt to having a corresponding entry in a web page table, as set forth 

find information results in data overload. The Internet is a in U.S. Fat. No. 5,974,455. Various schemes have been 

highly unorganized and unstructured repository of data, 20 proposed for ranking the results of such a search. For 

whose growth rate is ever increasing. As the data grows it example, U.S. Pat. No. 5,915,249 to Spencer sets forth a 

becomes more and more difficult to find it. system and method for accelerated query evaluation of very 

Early pioneers in information retrieval from the Internet large full text databases, and U.S. Pat. No. 6,021,409, to 
developed novel approaches, which can be categorized in Burrows discloses a method for parsing, indexing and 
two main areas: automated keyword indexing and manual 25 searching world-wide-web pages. These patents cover tech- 
document categorization. The large majority of current niques for creating full-text databases of content, usually 
search engines use both of these approaches. For example, world -wide-web pages, and providing functionality to 
the earliest generation of search engines, including Lycos, retrieve documents based on desired keywords. 
Altavista, and Webcrawler, as well as the most recent ones, Full-text databases of documents are generally used to 
such as Northern Light or FAST, arc all based on keyword 30 serve keyword-based search engines, where the user is 
indexing and searching. Another very popular search engine, presented with an interface such as a web page, and can 
Yahoo!, is actually a categorized repository of documents submit query words to the search engine. The search engine 
that have been manually categorized by human laborers. contains an inverted index of documents, where each word 

Searching for information using the keyword approach is mapped to a list of documents that contain it. The list of 

requires the user to input a set of words, which can range 35 documents is filtered according to some ranking algorithm 

from a single word to a natural language sentence. Normally, before being returned to the user. Ranking algorithms pro- 

the input is parsed into an unstructured set of keywords. The vided by full-text, keyword-based search engines generally 

set of keywords is then matched against an inverted index compute document scores based upon the frequency of the 

that links keywords with the documents in which they term within the document, where more mentions yield a 

appear. Documents with the most keywords that match the 40 higher score, as well as its position, earlier mentions leading 

input query arc retrieved. Some ranking process generally to a higher score. The three patents discussed above are all 

follows this retrieval, and orders the returned documents by typical representations of the prior art in text retrieval and 

how many times the query words appear within them. The indexing without natural language processing, 

problem with this approach is that no attempt is made to There has been substantial research in search technology 

identify the meaning of the query and to compare that 45 directed towards the goal of imposing structure on both data 

meaning with the meaning of the documents. Therefore, and queries. Several previous systems, such as set forth in 

(here is a clear need to develop new systems that can take U.S. Pat. Nos. 5,309,359 and 5,404,295, deal with manual or 

this into consideration. semi-automatic annotation of data so as to impose a struc- 

A second approach is manual document organization. A turc for queries to be matched to. In U.S. Pat. No. 5,309,359 

typical document categorization search engine, Yahoo!, does su to Katz, a process by which human operators select subdi- 

not contain an inverted index, but rather a classification of visions of text to be annotated, and then tag them with 

documents manually categorized in a hierarchical list. When questions in a natural language, is presented. These ques- 

a user queries Yahoo! , a keyword-based search is run against lions are then converted automatically into a structured form 

the words used to classify documents, rather than the docu- by means of a parser, using conccpt-rclation-concept triples 

ments themselves. Every time the search engine capability is 55 known as T-expressions, While the process of T-expression 

used, it displays the location of the documents within the generation is automatic, the selection of text to annotate with 

hierarchy. While this approach is useful to users, so far as it such expressions is manual or semi-automatic. Furthermore, 

means that other humans have employed common sense to systems such as Katz provide only for encoding of 

filter out documents that clearly do not match, it is limited questions, not for encoding of the documents themselves, 

by two factors. The first factor is that it docs not scale to the 60 Another approach is set forth in Liddy ct al, U.S. Pat. No. 

number of docuraents now available on the web, as the 5,963,940, which discloses a natural-language information 

directory only can grow as quickly as human editors can retrieval system. The system provides for parsing of a user's 

read and classify pages. The second factor is that it docs not query into a logical form, which may include complex 

understand the meaning of the query, and a document nominals, proper nouns, single terms, text structure, and 

classified under a particular word will not be retrieved by a 65 logical make-up of the query, including mandatory terms, 

query that uses a synonymous word, even though the intent ITie alternative representation is matched against documents 

is the same. in a database similar to that of the systems described 
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previously. However, the database does not contain a tradi- 
tional inverted index, linking keywords to the documents 
that they appear in, but rather an annotated form of the same 
form as the query representation. The documents are 
indexed by a system, which is modular and performs staged 
processing of documents, with each module adding a mean- 
ingful annotation to the text. On the whole, the system 
generates both conceptual and term-based representations of 
the documents and queries. 

In U.S. Pat. No. 5,873,056, Uddy ct al. additionally 
discloses a system that accounts for lexical ambiguity based 
on the fact that words generally have different meanings 
across multiple domains. The system uses codes to represent 
the various domains of human knowledge; such codes are 
taken from a lexical database, machine-readable dictionary, 
or other semantic networks. The system requires previous 
training on a corpus of text tagged with subject field codes, 
in order to learn the correlations between the appearance of 
different subject field codes. Once such training has been 
performed, a semantic vector can be produced for any new 
document that the system encounters. 'Ihis vector is said to 
be a text level semantic representation of a document rather 
than a representation of every word in the document. Using 
the disambiguation algorithm, the semantic vectors pro- 
duced by the system are said to accommodate the problem 
that frequently used words in natural language tend to have 
many senses and therefore, many subject codes. 

In U.S. Pat. No. 6,006,221, Liddy et al further discloses 
a system that extends the above functionality to provide 
cross-lingual information retrieval capability. The system 
relies on a database of documents subject to the processing 
discussed above, but further extends the subject field coding 
by applying it to a plurality of languages. This system 
includes part-of-speech tagging to assist in concept 
disambiguation, which is an optional step in the previously 
discussed system. Information retrieval is performed by a 
plurality of statistical techniques, including term frequency, 
index-document-frequency scoring, Pearson moment corre- 
lation products, n-gram probability scoring, and clustering 
algorithms. In the Liddy et al. system, clustering provides, 
the needed capability to perform visualization of result sets 
and to graphically modify queries to provide feedback based 
on result set quality. 

Another approach to natural-language processing for 
information retrieval is set forth in U.S. Pat. No. 5,794,050, 
to Dahlgren et al. Dahlgren et al. discloses a naive semantic 
system that incorporates modules for text processing based 
upon parsing, formal semantics and discourse coherence, as 
well as relying on a naive semantic lexicon that stores word 
meanings in terms of a hierarchical semantic network. Naive 
semantics is used to reduce the decision spaces of the other 
components of the natural language understanding system of 
Dahlgren et al. According to Dahlgren ct al, naive semantics 
is used at every structure building step to avoid combina- 
torial explosion. 

For example, the sentence "face places with arms down" 
has many available syntactic parses. The word "face" could 
be cither a noun or a verb, as could the word places". 
However, by determining that "with arms down" is statis- 
tically most likely to be a prepositional phrase which 
attaches to a verb, the possibility that both words are nouns 
can be eliminated. Furthermore, the noun sense of "face" is 
eliminated by the fact that "with arms down" includes the 
concepts of position and body, and one sense of the verb 
"face" matches that conception. In addition to the naive 
semantic lexicon, a formal semantics module is 



'5,159 Bl 

4 

incorporated, which permits sentences to be evaluated for 
truth conditions with respect to a model built by the coher- 
ence module. Coherence permits the resolution of causality, 
exemplification, goal, and enablement relationships. This is 

5 similar to the normal functionality of knowledge bases, and 
Dahlgren et al. claim that their knowledge is completely 
represented in first order logic for fast deductive methods. 

Natural language retrieval is performed by Dahlgren et 
al.'s system using a two-stage process referred to as diges- 

10 tion and search. In the digestion process, textual information 
is input into the natural language understanding module, and 
the NLU module generates a cognitive model of the input 
text. In other words, a query in natural language is parsed 
into the representation format of first-order logic and the 
previously described naive semantics. The cognitive model 

15 is then passed to a search engine, that uses two passes: a high 
recall statistical retrieval module using unspecified statisti- 
cal techniques to produce a long list of candidate documents; 
and a relevance reasoning module which uses first-order 
theorem proving, and human-like reasoning to determine 

20 which documents should be presented to the user. 

U.S. Pat. No. 5,933,822, to Braden-Harder et al., provides 
yet another natural language search capability that imposes 
logical structure on otherwise unformatted, unstructured 
text. The system parses the output from a conventional 

25 search engine. The parsing process produces a set of 
directed, acyclic graphs corresponding to the logical form of 
the sentence. The graphs are then re-parsed into logical form 
triples similar to the T-expressions set forth in Katz. Unlike 
the logical forms set forth in Katz or Dahlgren et al., the 

30 triples express pairs of words and the grammatical relation, 
which they share in a sentence. As an example, the sentence 
"the octopus has three hearts" produces logical form triples 
"have-Dsub-octopus", "have-Dobj-heart", and "heart-Ops- 
three". These triples encode the information that octopus is 

35 the subject of have, heart is the object of have, and three 
modifies heart. 

The Braden-Harder et al system provides a mechanism for 
the retrieval and ranking of documents containing these 
logical triples. According to the patent, once the set of 

40 logical form triples have been constructed and fully stored, 
both for the query and for each of the retrieved documents 
in the output document set, a functional block compares 
each of the logical form triples for each of the retrieved 
documents to locate a match between any triple in the query 

45 and any triple in any of the documents. The various gram- 
matical relationships discussed previously are assigned 
numerical weights, and documents are ranked by the occur- 
rence of those relations between the content words. The 
presence of the content words is not incorporated into the 

so ranking algorithm independently of their presence within 
logical triples matching the query triples. As a result, the 
Braden-Harder et al system replaces a keyword search based 
upon individual lexical items with a keyword search based 
upon logical triples. 

55 U.S. Pat. No. 5,694,523 to Wical discloses a content 
processing system that relies on ontologies and a detailed 
computational grammar with approximately 210 grammati- 
cal objects. The Wical system uses a two-level ontology 
called a knowledge catalog, and incorporates both static and 

go dynamic components. The static component contains mul- 
tiple knowledge concepts for a particular area of knowledge, 
and stores all senses for each word and concept. However, 
it does not contain concepts that are extremely volatile. 
Instead, the dynamic component contains words and con- 
es cepts that are inferred to be related to the content of the static 
component. Such an inference is accomplished through 
multiple statistical methods. 
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The Wical system is further described in U.S. Pat. No. word is applied to a sentence network to determine the 

5,940,821. An example is given therein stating that a docu- expansion words corresponding to the query term. For a 

ment about wine may include the words "vineyards", given query word, only those expansion words from the 

"Chardonnay", "barrel fermented", and "French oak*', which semantic network that are of the same part of speech are 

are all words associated with wine. These words are then 5 a ddcd to the terms in the natural language query. If a query 

weighted according to the number of times they occur in the lerrn is a proper nouri) otner lerms in tne sem antic network 

wine context within the body of the documents processed by are not act i V ated, even those that are also nouns, as the terms 

the Wical system, with one distance point or weight for each are uni^y l0 be similar. Schultz further discloses a 

one hundred linguistic, semantic, or usage associations relevance-scoring algorithm, which compares the query 

identified during processing. As a result, the system of Wical 10 lcnns to lhc tcxt information fields that serve as metadata 

automatically builds extensions to the core ontology by wilhin an information retrieval system. The Schultz system 

scoring words that frequently appear in the context of known also discloses techniques for preparing and loading docu- 

concepts as probably related concepts. The scoring algo- ments and mu i timedia mt0 the information retrieval system, 

nthm of Wical is fairly conservative, and should generally However, such techniques do not involve manipulation or 

produce reliable results over large corpuses of data. 15 reparsing 0 f tne documents and do not constitute an advance 

The Wical system produces a set of so-called theme on anv 0 f the previously discussed indexing systems, 

vectors for a document via a multi-stage process that makes Thc ^ap^d indcx i ng and search system of the 

use of the forgoing knowledge catalogue system includes t inveniioR hag distinct advanl over lne a acn 

a chaos processor that receives the input discourse, and M forth in tfae Katz and lhe Qlher previously M forth 

generates the grammatical structured output. Such gram- 20 ^ ^ ^ ffl n{& of the texl ^ 

matical structured output includes identifying the various formal re esentatioDS of ific tions lhal the text 

parts of speech, and ascertaming how the words, clauses, r cscnts answcrs ttK Whilc such an a h guarantees 

and phrases in a sentence relate to one another. lhat a quest ion will be answered if the quesUon has been 

Consequently the Wical system produces not only word- previously ^ lhe process ^ limited by lhe effici ency of 

level part-of-speech categonzation (..e noun verb, 25 ^ f ffl The Ka(z can ide fully 

adjective, etc ) but also relations such as subject and object. automalic tagging of {exl . However, the implementation of 

The output of the chaos processor is then passed to a theme a { f ffl ^ cafl aulomaticall ale appropriat e 

parser processor that discriminates the importance of the ^ fof each m of { ^ ^ g^^ted 

meaning and content of the text on the basis that all words machine reasoni capabilities, which do not yet exist, 

in a tcxt have varying degrees of importance, some carrying 30 

grammatical information, and others carrying meaning and SUMMARY OF THE INVENTION 

content. After the theme parser processor has generated this „„ „ , , , „ . , , , , , 

information, it is considered to be theme-structured output, ^ for e oiD ? and ^encies are addressed by the 

which may be used for three distinct purposes. One purpose P™ 86 ? 1 ™™t™> lS direc ted to f concept-based 

is providing thc topics of thc discourse in a topic extractor. 35 mdCX1 " g xmh SyStem ; M ° r ! P artICU , ^ the P™ 

A second purpose is generating summarized versions of the invention relates to system that indexes collections of docu- 

discourse in a kernel generator. The third purpose is iden- ments W1 j lh ™tology-based predicate structures through 

tifying the key content of the discourse in a content extrac- automated and/or human-assisted methods. The system 

tor. The forgoing steps are performed in parallel, and require e * tracts ^ eh u md user ^ uenes t0 rcturn ^ those 

additional processing of the theme-structured output in order 40 documents that match those concepts, 

to generate textual summaries, or graphical views of the ^ concept-based indexing and search system of the 

concepts within a document. Such an output may be used in present invention has a number of advantages over the 

a knowledge-based system that identifies both documents conventional systems discussed previously. These advan- 

and concepts of interest with regard to the inquiry of the toges fall into two categories: improvements in the precision 

user, and a research paper generation application that pro- 45 of information retrieval, and improvements in the user 

vides summaries of documents relevant to a query, as interface. 

produced by the kernel generator set forth previously. The concept-based indexing and search system of the 
Ausborn, U.S. Pat. No. 5,056,021, discloses a simpler present invention can utilize any of the previously discussed 
technique for creating searchable conceptual structures. Thc systems to collect documents and build indices. An advan- 
technique of Ausborn uses a database of words organized 50 ta & c of tnc present invention over thc conventional systems 
into levels of abstraction, with concepts arranged as clusters is in the area of retrieval and ranking of indexed documents, 
of related words. The levels of abstraction are implemented The concept-based indexing and search system of the 
in thesauri, and are equivalent to hierarchical levels of an present invention is an improvement over the Katz system in 
ontology. Thc system of Ausborn serves as a parser to that it transforms the text into a formal representation that 
directly compile thesaurus entries into a cluster representing 55 matches a variety of possible questions. Whereas, thc Katz 
cluster meaning. Clusters are then pruned by virtue of a lack system requires the questions to be known in advance, even 
of common ontological features among the words in the if automatically generated, the present invention does not 
sentence. Sentences whose words do not have similar mean- require prior knowledge of thc questions. As a result, thc 
ing at equal levels of abstraction are judged as erroneous present invention provides significant improvements in seal- 
parses. 'ITiese structures can then be searched in a manner go ability and coverage. 

equivalent to the technique set forth in the Braden-Harder ']*he present concept-based indexing and search system 

patent. also presents an advantage to the information retrieval 

Some efforts have been made towards expanding queries. systems of Uddy ct al. Thc monolingual implementation of 

For example, U.S. Pat, No. 5,721,902, to Schultz, discloses Liddy et al. constructs vector representations of document 

a technique employing hidden Markov models to determine 65 content, with vectors containing complex nominals, proper 

the part of speech of words in a sentence or sentence nouns, text structure, and logical make-up of the query. The 

fragment. Once the part of speech has been selected, the logical structure provided is equivalent to first-order predi- 
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cate calculus. Implementations of the system of Liddy et a I. FIG. 7 is a block diagram of the Bayes classifier training 

have been used to provide inpul to machine reasoning stage according to one variation of the present invention; 

systems. The Liddy et al. system makes further provisions FIG. 8 is a diagram illustrating the Bayes training and 

for subject codes, used to tag the domain of human knowl- classifier algorithms according to one variation of the 

edge that a word represents. The subject codes arc used to 5 present invention; 

train statistical algorithms to categorize documents based on F1G 9 ^ a d iagram illustrating the Bayes classifier 

the co-occurrence of particular words, and corresponding documenl classification pr0 cess according to one variation 

subjecl codes, lhe resulting system is a text level semantic of lhc m invcntion; 

representation of a document rather than a representation of . . - 1 . n . 

. , i . .1 i t r FIG. 10 is a block diagram of the Bayes classifier query 

each and every word in the documenl. 10 , °f .J c « 

classification according to one variation or the present 

The present system imposes a logical structure on text, invention* 

and a semantic representation is the form used for storage. nio .1 . r , 

™ t r - . . . 6 FIG. 11 is a diagram of a parser tree according to one 

The present system further provides logical representations c 0 t . 4 . 

r T. J . , c variation of the present invention; 

for all content in documents. The advantages of the present «~ . « . 

system are the provision of a semantic representation of « f}°\ 12 J f a flow <; harl of an exam P! e of dtssilicaUon 

comparable utility with significantly reduced processing collection flow according to one variation of the present 

requirements, and no need to train the system to produce invention; 

semantic representations of text content. While training is FIG - 13 « flow chart of an example of classification 

needed to enable document categorization in the present training flow according to one variation of the present 

system, which improves the precision of retrieval, genera- 20 invention; 

tion of the semantic representation is independent of the FIG. 14 is a flow chart illustrating an example of the 

categorization algorithm. classification process according to one variation of the 

The concept based search engine of the present invention present invention; 

also presents advantages over Dahlgren et al.'s system, FIG. 15 is a flow chart of query topic example set 

embodied in U.S. Pat. No. 5,794,050. The Dahlgren system 25 generation according to one variation of the present inven- 

uses a semantic network similar to the ontologies employed tion; 

in the system of present invention. However, it relies on a FIG. 16 is a flow chart of query topic classification 

complicated grammatical system for the generation of for- training according to one variation of the present invention; 

mal structures, where complicated grammatical information and 

is needed to eliminate possible choices in the parser. The 30 FIG. 17 is a flow chart of a trained query topic classifier 
concept based search engine system of the present invention identification of an input query according to one variation of 
provides an advantage in that it uses a simple grammatical me present invention, 
system in which rule probabilities and conflicting onto logi- 
cal descriptions are used to resolve the possible syntactic DETAILED DESCRIPTION OF THE 
parses of sentences. This greatly reduces the processing 35 INVENTION 
power required to index documents. ln the following detailed discussion of the present 

From the foregoing, it is an object of the present invention invention, numerous terms, specific to the subject matter of 

to provide a concept based search and retrieval system a system and method for concept-based searching, are used, 

having improved functionality over conventional search and 4Q i n order to provide complete understanding of the present 

retrieval systems with equivalent efficiency in returning web invention, the meaning of these terms is set forth below as 

pages. follows: 

Another object of the present invention is to provide a The term concept as used herein means an abstract formal 

concept-based search and retrieval system that comprehends representation of meaning, which corresponds to multiple 

the intent behind a query from a user, and returns results 4S generic or specific words in multiple languages. Concepts 

matching that intent. may represent the meanings of individual words or phrase, 

Still another object of the present invention is to provide or the meanings of entire sentences. The term predicate 

a concept-based search that can perform off-line searches for means a concept that defines an n-ary relationship between 

unanswered user queries and notify the user when a match other concepts. A predicate structure is a data type that 

is found. so includes a predicate and multiple additional concepts; as a 

grouping of concepts, it is itself a concept. An ontology is a 

BRIEF DESCRIPTION OF THE DRAWINGS hierarchically organized complex data structure that pro- 

These and other attributes of the present invention will be vides a context for the lexical meaning of concepts. An 

described with respect to the following drawings in which: ontology may contain both individual concepts and prcdi- 



F1G. 1 is a block diagram of the concept based search and 



cate structures. 



retrieval system according to the present invention; The P resent svstem and method for concept-based search- 

FIG. 2 is a block diagram of the query ontological parser ^distinguishable from an ontology-based search sys- 

according to the present invention; lem < . P"" 1 * ontology-based search system would expand 

? i *r> « . queries from particular words to include synonyms, 

FIG. 3 » a block d.agram of a Bayes classifier according 6Q {nR and ^ ( submarine is a synonym 

to one variation ot the present invention; ^ [BM ^ an instance of a and 

FIG. 4 is a block diagram of the sentence lexer according ^ a pafent concept 0 f automobile). However, such an 

to one variation of the present invention; ontology-based search system would only search for docu- 

FIG. 5 is a block diagram of the parser according to one me nts containing other words that are defined by the ontol- 

variation of the present invention; 65 og y lo ^ re l a tcd to the query. On the other hand, a method 

FIG. 6 is a block diagram of the lia yes classifier collection and system for concept-based searching according to the 

stage according to one variation of the present invention; present invention has the capabilities of an ontology-based 
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search system plus it can search for logically structured 
groupings of items from the ontology. 

Referring to FIG. 1, the concept-based indexing and 
search system 100 of the present invention includes a user 
interface 110, a query ontological parser 120, a Bayes 5 
classifier 130, a document ontological parser 140, a data 
repository 150, a persistent agent 160, a comparison and 
ranking module 170, and a document clustering component 
180. 

The user interface 110 provides a means for both system 10 
engineers and end users to interact with the system. The 
system engineering interface 112 of the user interface 110 
allows engineers the ability to maintain, monitor and control 
system operations. There are two methods of monitoring of 
the system. The first method uses a graphical user interface 
(GUI) that displays current tasks in progress, user logs, and 
other types of statistical information. The second method of 
monitoring the system is through event-driven notification. 
The system alerts an engineer of particular events or irregu- 
larities in system operation that may require immediate ^ 
attention. The engineers receive notification of these events 
via short messaging service to PCS wireless phones or 
e-mail to their desktops. To fine tune and control the system, 
the graphical user interface will provide engineers methods 
for managing internal processes and system properties. To ^ 
train the system further, engineers will be able to input 
queries using the same graphical user interface. 

The end user interface 114 provides a clear and simple 
graphical user interface that allows users to submit queries 
to the system. The end user interface 114 is responsible for 3Q 
creating a secure connection to the system for query. Once 
connected, the end user interface 114 will register with the 
system to receive results for the user query. Once the query 
is processed, the end user interface 114 will format and 
present the results to the user in a logical manner. 35 

In order to support different types of users to the system, 
the user interface 110 can be implemented in whatever 
fashion is most desirable for the user. For example, web 
browser users can submit queries to the system via an 
HTML web site utilizing the Hyper Text Transfer Protocol 40 
(HTTP). Application-level connections may use the concept 
based search engine through standards such as CORBA or 
Java RMI. In order to provide such flexibility, the system 
will rely on plug and play communication components that 
will translate information between the client and server 4S 
modules. These communication components allow for quick 
integration between completely separate systems. 

The query ontological parser 120 is a component of the 
system, which transforms user queries entered in natural 
language into predicates, the formal representation system 5 o 
used within the concept based search and retrieval system of 
the present invention. While the query ontological parser 
120 is optimized for parsing user queries, it is identical to the 
document parser 140 discussed in detail below. 

As shown in FIG. 2, the query ontological parser 120 has 55 
five components, namely a sentence lexer 122, post-lexer 
filters 123, a parser 124, post-parser filters 125, and an 
ontology 128. The sentence lexer 122 transforms input 
sentences into p a rt-of-speech- lagged instances of concepts 
from the ontology 128. Any ontology 128 may be used, as 60 
may any part-of-speech tagging algorithm. Multiple 
sequences of ontological concepts may be produced by the 
sentence lexer 122. Consequently, post -lexer 122 filters 123 
are employed to prune out some of the sequences based on 
rules about sequences of syntactic tags. 65 

'lrie parser 124 creates syntactic tree structures that rep- 
resent the grammatical relations between the ontological 
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concepts, based on the syntactic tags attached to the con- 
cepts. The tree structures are created through the use of a 
context-free grammar, and may be implemented through a 
variety of techniques. Post-parser filters 125 are used to 
eliminate parse trees based on rules about improbable syn- 
tactic structures, and rules about conflicting ontological 
specifications. 

'I Tie ontology 128 is a lexical resource that provides 
information about words, including both their possible syn- 
tactic uses and their meaning. WordNct™ is used in the 
example embodiment discussed below; however, any ontol- 
ogy 128 may be used. 

The sole function of the query ontological parser 120 is to 
use the five components to create predicate structures, which 
arc then used as keys to search the data repository for 
documents which match the query of the user. 

The Bayes classifier 130, shown in FIG. 1, is a tool for 
probabilistically classifying documents and queries by the 
users. The Bayes classifier 130 is an implementation of the 
known naive Bayes classifier approach for classifying text, 
from which the system can build query-topic -specific and 
document-domain-specific classifiers. A document-domain- 
specific Bayes classifier uses the words that make up a 
particular concept as features to determine if a particular 
document belongs to a specific domain. A query-topic- 
specific Bayes classifier uses the words that make up a 
particular question as the features it uses to identify the topic 
of the question. The primary reason for using a Bayes 
classifier 130 is because it is a rapid, effective technique for 
reducing the number of documents to be searched in 
response to a query. The indexing, parsing and searching 
time of the system is thereby dramatically reduced. 

Referring to FIG. 3, the Bayes classifier 130 has two main 
components, a learner 132 and a reasoner 134. The learner 
132 is responsible for training the classifier. Before the 
classifier 130 is trained, it is totally naive and cannot 
properly classify text. During the training process, the 
learner 132 obtains knowledge from a set of example data 
called the example set, and generates a set of trained 
document examples called the training set. 

The reasoner 134 is the question-answering component of 
the Bayes classifier 130. The reasoner 134 is responsible for 
determining the probability that each pre -classified docu- 
ment is the correct answer to a given question. The reasoner 
134 makes a decision based on the knowledge acquired 
during the learning process. 

In general, the Bayes classifier 130 will perform a learn- 
ing process and an answering process 134. The learning tree 
process uses a set of hand-classified documents to train the 
classifier 130 from an original naive state to one in which it 
can correctly classify new documents. The reasoner 134 
answering process has two types of output, based on the 
input to the classifier 130. First, the trained classifier 130 
accepts a new document as input and calculates the prob- 
ability that the input document belongs to the specific 
domain it is trained to classify. In the second mode, the 
trained classifier 130 accepts a query and calculates the 
probability that the query belongs to the topic the classifier 
130 is trained on. The resulting probabilities in either case 
determine the response from the reasoner 134. 

Returning to FIG. 1, the document ontological parser 140 
is used by the concept -based search and retrieval system to 
transform documents into predicate structures for storage in 
the data repository. Tlie document ontological parser 140 is 
one of two versions of ontological parser 140 used by the 
concept based-search and retrieval system 100. The docu- 
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merit ontological parser 140 is optimized for the grammati- 
cal structure of documents meant to be indexed, but is 
otherwise identical to the query ontological parser 120. 

The document ontological parser 140 contains the same 
components as the query ontological parser 120; however, 5 
the grammar for the document ontological parser 140 is 
written for declarative sentences, which are the type usually 
found in documents. The document ontological parser 140 
receives documents and passes predicate libraries to the data 
repository 150. 'i*he document ontological parser 140 is only 10 
used in the indexing process, while the query ontological 
parser 120 is only used during retrieval. 

The data repository 150 manages data persistence and 
provides methods for data storage and retrieval. The data 
repository 150 hides the storage implementation details to 
the rest of the system by providing a general set of methods 
and interfaces for data access. The object-oriented design of 
the data repository 150 allows for different types of under- 
lying storage systems to be utilized. Possible storage imple- 
mentations may consist of a relational or object oriented 
database. Databases are generally used to store massive 
quantities of data generated by search engine indexes. By 
abstracting the underlying storage interface, a variety of 
solutions can be implemented without modification of other 
components. 

The function of the data repository 150 is to provide 
uncomplicated access of complex data structures, through 
encapsulation of the current method of data storage being 
implemented. The data repository 150 stores three different 
types of data: result data, Bayes data and ontology data. With 
regards to system initialization, the data repository 150 
provides access to stored data for the Bayes classifier 130 
and ontological parser 140. In order for the Bayes classifier 
130 to filter documents, it must retrieve training set data 
stored in the data repository 150. When domain specific 
classifiers are retrained, the new training set data is stored in 
the data repository 150 in two ways: for retrieving ontology 
data, and for storing and retrieving predicate library struc- 
tures. As training queries are processed, the query predicate 
libraries generated by the query ontological parser 120 and 
matching document predicate libraries from the document 
ontological parser 140 are stored together within the data 
repository 150 for later user query. 

The concept-based search and retrieval system will some- 45 
times fail to answer a query from predicate-based data 
repositories 150. The data repository 150 might not have 
enough indexed documents at the time of the search. A 
persistent agent-based approach takes advantage of unan- 
swered queries to index new documents capable of answer- 50 
ing later queries about the same subject or to notify the user 
of the original query when such document is found. 

For example, a query may be: "Does Botswana possess 
biological weapons?" If the comparison and sorting algo- 
rithm docs not find any documents with predicate structures 55 
indicating that Botswana has biological weapons in the data 
repository 150, the answer returned may be: no documents 
found. However, the comparison and sorting algorithm may 
determine that some documents containing related informa- 
tion exist. The algorithm may produce documents about 
another country from the same region having biological 
weapons, or information about Botswana having other types 
of weapons. 

When the query answer is "no documents found," the 
system can assume that it failed to answer the question. 
Furthermore, when the system returns any document, the 
user has the opportunity to provide input as to the accuracy 
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of the match between the documents provided and the 
intention behind the query of the user. Thus, even if the 
system provides some matching documents, the user might 
decide that the system 100 failed to answer the question 
appropriately by clicking on a button or checking a box in 
the interface 114. Based on such feedback, the persistent 
agent can attempt to find documents that could answer 
similar queries in the future by creating persistent agents that 
attempt to find ontological predicate structures that match 
the query. 

The persistent agent maintains a collection of predicate 
structures extracted from the query. All new documents in 
the data repository 150 are searched, or all newly located 
documents parsed by the document ontological parser 140 
are monitored. Once a document is found that closely 
matches the query, a notification is sent to the user. In 
addition, the document is indexed accordingly and placed in 
the data repository. 

'llie concept based search and retrieval system 100 may 
also fail to produce a valid response if the query is formu- 
lated in terms not previously included in the ontology. In 
such an instance, the persistent agent notifies a knowledge 
engineer that a query, sent by a user might contain a new 
concept, and waits for confirmation by the engineer that the 
query is a new concept, and the new concept is added to the 
ontology. The persistent agent then monitors the new docu- 
ments as they are indexed by the document ontological 
parser 140. 

The next component of the system is the comparison and 
ranking module 170. The basic premise of relevancy search- 
ing is that results are sorted, or ranked according to certain 
criteria. The system provides a comparison and ranking 
algorithm, described below, to determine the similarity 
between a query from a user and a document, and rank each 
document based upon a set of criteria. Since the concept 
based search and retrieval system 100 will break down each 
natural language text into a set of predicates, the documents 
arc represented as a predicate library. A user query is 
converted to one or more predicates. The ranking and 
comparison algorithm is designed to rank the similarity 
between two predicate structures. 

In order to determine precisely the information converted 
by the predicate, the algorithm implements a modifier strat- 
egy to adjust the weight for each factor that modifies the 
information converted to the system. The concept based 
search and retrieval system 100 defines 13 different types of 
modifiers with which to adjust the weight of each factor, as 
shown in Table 1 below. 

TABLE 1 

Modifier Name Explanation of Modifier 

Weight for matching two predicates 
structure's verb parts 
Weight for matching two predicates 
structure's argument parts 
weight for matching complete predicate 
structures. 

weight for two exactly matched predicate 
structures 

weight for two exactly matched predicates 

weight for two exactly matched concepts 

Weight consider the ontological relationship 
between two concepts 
Weight for two words are from same stem 
Weight for two argument are exactly match 



VcrbOnlyMatchModificr 

NounCountModifier 

Vc rb Nou n Match Mod i fie r 

PredicateStructure 
ExactMatchMod ifier 
Predicate Exact 
MatchModifier 
ConceptExact 
MatchModifier 
ConceptProximity Modifier 

SameStemModifier 

A rgu m entM atch M od i fie r 
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TABLE 1 -continued 


Modifier Name 


Explanation of Modifier 


ProperNounExact 


weight for two exactly matched proper 


MatchModifier 


nouns 


Symbol MatchModifier 


weight for two matched symbols 


FronttineModifier 


weight for a predicate in which the 




corresponding sentence occurs within the 




first 30 sentences of the document 


DocSize Modifier 


weight for adjust document size 



The document clustering component 180 provides one 
final level of filtering accumulated documents against the 
user's query. This ensures that users receive the optimal set 
of matching documents with minimal extraneous or irrel- 
evant documents included. 

On occasion, the best-fit set of document matches for a 
particular query will contain an excessively large number of 
documents. This can happen, for example, when the user's 
query is too general to enable the system to eliminate enough 
documents. Under these circumstances, the document clus- 
tering component 180 provides an intelligent, adaptive filter 
to focus user attention on those documents most likely to 
meet user needs and interests. 

The document clustering component 180 uses adaptive 
self-organizing feature map technology to appropriately and 
dynamically cluster the documents returned by the system as 
matches for the concepts in the user's query. The document 
clustering component 180 processes the set of documents 
proposed as matches based, not on individual concepts listed 
within those documents, but rather on the basis of patterns 
of concepts throughout the document. These complex con- 
cept patterns are used as the inputs to a self-adaptive feature 
map, which then automatically creates a cluster model that 
represents the statistical probability distribution of the docu- 
ments in the proposed matching set. A sample document 
from the matching set is shown to the user, who can specify 
that the example is cither "similar to the desired documents" 
or "not very similar to the desired documents." 

If the user declares that the example is similar to the 
documents desired, the document clustering component 180 
immediately returns those other documents within the over- 
all set that most closely cluster around the example docu- 
ment. If the user declares that the example is not similar to 
the documents desired, the document clustering component 
180 presents another example, this one chosen from a 
document cluster far from the first example's cluster. The 
user may again decide if this new sample is "similar to" or 
"not similar to" the desired documents. 

In essence, the user, within a very few examples, often as 
few as one or two, partitions the space of matching docu- 
ments into more refined categories. This partitioning is 
based on the global pattern of concepts across the entire 
document, as opposed to the individual concepts used in the 
other stages of processing within the system. This is, in 
effect, a higher-level meta-search through concept-pattern 
space, providing greater discrimination and more refined 
selection capabilities. 

In addition to this mode, the document clustering com- 
ponent 180 also pre-clusters those documents stored in the 
data repository, so that they can be more easily and effi- 
ciently processed to generate and refine query matches. 

Ontological parsing is a grammatical analysis technique 
built on the proposition that the most useful information, 
which can be extracted from a sentence, is the set of 
concepts within it, as well as their formal relations to each 
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other. It derives its power from the use of ontologies to 
situate words within the context of their meaning, and from 
the fact that it does not need to find the correct purely 
syntactic analysis of the structure of a sentence in order to 

5 produce the correct analysts of the sentence's meaning. 
The ontological parser is a tool, which transforms natural- 
language sentences into predicate structures. Predicate struc- 
tures are representations of logical relationships between the 
words in a sentence. Every predicate structure contains a 

10 predicate, which is either a verb or a preposition, and a set 
of arguments, which may be any part of speech. Predicates 
are words which not only have intrinsic meaning of their 
own, but which also provide logical relations between other 
concepts in a sentence. Those other concepts are the argu- 

15 men Is of the predicate, and are generally nouns, because 
predicate relationships are usually between entities. 

The ontological parser 120 contains two significant func- 
tional components, namely, a sentence lexer 122, a tool for 
transforming text strings into ontological entities and a 

20 parser 124, a tool for analyzing syntactic relationships 
between entities. 

The architecture of the sentence lexer 122 is shown in 
FIG. 4. A Document iterator 210 receives documents or text 
input 205, and outputs individual sentences to the lexer 122. 

25 As the lexer 122 receives each sentence, it passes each 
individual word to the ontology 128. If the word exists 
within the ontology 128, it is returned as an ontological 
entity; if not, it is returned as a word tagged with default 

3Q assumptions about its ontological status. In one 
embodiment, words are automatically assumed to be nouns; 
however, the words may be other parts of speech. 

After the lexer 122 has checked the last word in a sentence 
against the contents of the ontology 128, the sentence is 

35 passed to a series of lexer filters 125. Filters 125 are modular 
plug-ins, which modify sentences based on knowledge about 
lexer word meanings. r Irie preferred embodiment contains 
several filters 125, although more may be developed, and 
existing filters may be removed from future versions, with- 

40 out altering the scope of the invention. The document 
ontological parser 140 employs the following filters: proper 
noun filter, adjective filter, adverb filter, modal verb filter, 
and stop word filter. The query ontological parser 120 makes 
use of all these filters, but adds a pseudo-concept filter. 

4S The stop word filter removes stop words from sentences. 
Stop words are words that serve only as placeholders in 
English-language sentences. The stop word filter will con- 
tain a set of words accepted as stop words; any lexeme 
whose text is in that set is considered to be a slop word. 

so An adjective filter serves to remove lexemes representing 
adjective concepts from sentences. Adjective filter checks 
each adjective for a noun following the adjective. The noun 
must follow either immediately after the adjective, or have 
only adjective and conjunction words appearing between the 

55 noun and the adjective. If no such noun is found, the 
adjective filter will veto the sentence. The noun must also 
meet the se lectio nal restrictions required by the adjective; if 
not, the adjective filter will veto the sentence. If a noun is 
found and it satisfies the restrictions of the adjective, the 

60 adjective filter will apply the se lectio nal features of the 
adjective to the noun by adding all of the adjective's 
sclectional features to the noun's set of sclcctional features. 

The proper noun filter groups proper nouns in a sentence 
into single lexical nouns, rather than allowing them to pass 

65 as multiple-word sequences, which may be unparsablc. A 
proper noun is any capitalized Lexeme representing a noun 
concept. If a word appears at the beginning of a sentence, it 
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is considered capitalized (and therefore a proper noun) if and known as Bayes* theorem or Bayes' formula, 

only if it was not present in the lexicon. Although a number Mathematically, the formula is represented as 
of proper nouns are already present in the lexicon, they are 

already properly treated as regular lexical items. Since ^ D ^ _ PjQ I c^Pjcj) 

proper nouns behave syntactically as regular nouns, there is 5 J f\D) 
no need to distinguish proper nouns and nouns already in the 

lexicon. The purpose of the proper noun filter is to ensure whefe [s a iWe dass ifl the set of aU classes 

that sequences not already in the lexicon are treated as single C , D is the document to be classified, P (c, D) is the 

words where appropriate. probability that document D belongs to class c y . 

The modal verb filter removes modal verbs from sentence 10 This method has the underlying assumption that the 

objects. Modal verbs are verbs such as "should", "could", probability of a word occurring in a document given the 

and "would". Such verbs alter the conditions under which a document class c ; is independent of the probability of all 

sentence is true, but do not affect the meaning of the other words occurring in that document given the same 

sentence. Since truth conditions do not need to be addressed document class: 

by the ontological parser 120 or 140, such words can be 15 

eliminated to reduce parsing complexity. The modal verb w fl |c,-) = p[ Pto|c y ) 

filter will contain a set of modal verbs similar to the stop 
word list contained in stop word filter. Any Lexeme whose 

text is in that set and whose concept is a verb is identified as t , x ™ . • . •/* ... 

a modal verb, and will be removed. 20 ™ heTt • '. W > D ' ^ thlS claSS,fier P icks the beSt 

hypothesis c* given by: 

The adverb filter removes Lexemes containing adverb 

concepts from sentences. Adverbs detail the meaning of the „ 

verbs they accompany, but do not change them. Since the c r = argmax/Vj ; ( w x , w„) = argmax^^oj"] P[wj\cj) 

meaning of the sentence remains the same, adverbs can be 25 c > eC CJGC 1=1 
removed to simplify parsing. 

The pseudo-concept filter operates only in the query The Bayes classifier 130 is a tool for answering question 

ontological parser 120. It removes verbs from queries, which of rigid form based on a collection of example answers. It is 

are not likely to be the actual predicate of the query. capable of learning from a set of example questions and their 

Pseudo-predicate verbs include "give", "show", and "find". 30 answers. These examples are referred to as the Example Set. 

Not all instances of these verbs are pseudo-predicates; After learning from them, the classifier is said to be trained, 

however, the first instance of them in a query often is. The A trained classifier is capable of answering questions whose 

initial deterministic rule to be used in implementing pseudo- form is similar to the forms of the questions in the example 

concept filter is that it should remove any instance of these set. The answers that a classifier can give or learn from are 

verbs not preceded by a content-bearing noun (i.e., one not 35 called values. A Bayes classifier 130 can only answer a 

appearing in the list of pseudo -concepts or stop words). question with a value that was present in the Example Set. 

The pseudo-concept filter operates only in the query Questions consist of a set of attributes, each of which is a 

ontological parser 120. It removes concepts from queries, word, token, or term. A classifier can only consider attributes 

which are not likely to be the actual concept the user intends. that were present in its Example Set when answering a 

Pseudo-concepts arc largely nouns, and can be captured by 40 question. 

a stop word list. Pseudo-concepts include "I", "me", "you", The construction and use of a Bayes classifier 130 in the 

and in certain syntactic usages, "information", "news", and system can be decomposed into three functional stages: 
related words. Two rules arc included in the pseudo-concept 
filter implementation. 1 "he first rule is that any word relating 

to the user, or his current situation, such as "I" or "me" is 45 Z , « „ 7. , „ , A t e TZ 

* ' Example Collection Stage: collect all example data from a set of 

always deleted. The second rule is that any of the example documents. 

" inform ation"-type words is deleted when followed by a Classifier Training Stage: Learner trains classifier using given example 

preposition. daUl - 

Classifier Reasoning Stage: Reasoner 134 answers question based on lLs 

The architecture of the parser 124 is shown in FIG. 5. trained knowledge 



First, the sentence receiver 310 obtains sentences consisting 

of ontological entities produced by the sentence lexer 122. pio. 6 iUustra.es the Bayes classifier 130 example col- 

The.se sentences are parsed by the parser 24, which is , ection for a domain ^ classifier . - llle slem 

designed to use a context-tree grammar, although other ides a OTlloction ot example documenls> which contains 

grammatical models may be used without deputing from the 5J ^ documents re|evant l0 lhe dfic domain and docu . 

scope and sp.nl i of the invention. Sentences are parsed into M ^ ^ iflc domam -p op ; c editors manu . 

structures called parse trees, wh.ch represent the relation- a „ dassi[ ^ fe ^ ^ a va]ue 

ships between concepts in a .sentence Parse tiee converter „ YES „ of , <N0 „ , Q j( , o mdicate tb( . docameM is , itive 

315 rece.ves the output of the parser 124, and converts the Qf e h documen , ^ To fc £ di|0 „. 

parse trees mto pred.cates. Following the Parse tree 60 classifi 6 cation , asshown in FIG . 6, the attribute extractor 400 

converter, Parser filters 125 operate on the pred.cates to wi „ ^ from ; , e documents 

remove erroneously generated predates based on ru es J90 ^ fc ^ 4W Js , hen a , ed ^ Qn ^ 

about the probab.l.ty of syntactic analyses, as well as rules aUribu|es ^ ^ of each E |e documentS- Example 

about the compatibility of concepts with each other. ^ 4W ^ be stored ^ da , a rep ^ itory for , aler ^ 

The Naive Bayes is an established text classification 65 Attribute extractor 400 includes processes to eliminate the 

algorithm based on Baycsian machine learning technique. stop words in the input documents and eliminate words case 

The algorithm is based on a simple theorem of probability sensitivity problem. In general, all words in a document are 
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converted to lower case before performing attribute collec- 
tion. Stop words are those words that are judged to be so 
common in an information system that they have little 
information value. Such as "a", "the", "of and "is", 
"could", "would". The elimination of stop words will reduce 
the example set 410 size and save data repository space. 

For a query-topic-specific classifier the system will pre- 
define a set of topics that the query will fill in, such as "art", 
"business", "finance", "health", "science". For each topic, 
topic editors input a set of sample queries. Hie attribute 
extractor 400 will collect all attributes from the sample 
queries, and generate Example set 410. The value assign to 
each query example is the topic's name. 

Besides including a filter stop words process, attribute 
extractor 410, for the query -topic -specific classifier, includes 
a process to expand the attributes based on its meaning in the 
sample queries. For example for a query like "What is the 
current situation of the stock market?" Attribute extractor 
400 extracts direct attributes "current", "situation", "stock", 
and "market" from this query. It also can expand attribute 
"stock" to "finance", "banks", "brokerages" "Wall Street," 
etc. by implementing a concept ontology that contains 
hierarchical concepts. This process will improve the speed 
of generating an example set 410 and save topic editors* 
editing time, 

A classifier training stage using the collected example 
data to train the classifier is shown in FIG. 7. The algorithm 
for training the Bayes Classifier is shown in FIG. 8. 

The learner component 132 sends a request to data 
repository 150 to get an example set 410 generated. Then the 
learner 132 uses the data stored in example set 410 to train 
the classifier 130. Actually the training process of Bayes 
classifier 130 is the process to calculate the class value 
probability P(c y ) and conditional attribute probability P(w ( c ; ) 
the training process will populate the training set 440 with 
attribute, class value and calculated probability values. A 
populated training set 440 represents the knowledge of a 
Bayes classifier 130. 

In a classifier reasoning stage, a trained Bayes classifier 
130 performs question-answering functions based on the 
knowledge learned from classifier training process. 

In general, the system will accept an input text. Attribute 
extractor 400 collects all attributes in this input document 
and sends them to the reasoner 134, After the reasoner 134 
accepts the input attributes, it sends a request to the data 
repository 150 to access correspond a training set 440. 
According to the Bayes* formula, the reasoner 134 will 
calculate the answer probability P(c ; .w„ w 2 , wj for each 
possible class value Cy and pick the class value c- with the 
maxim answer probability as the final answer. 

The algorithm for calculating answer probability P(c y w ( ., 
w 2 , w M ) is described as follows and is shown in FIG. 8. 

Two different kinds of Bayes classifiers are defined in this 
system. One is document domain specific classifier, which is 
used to determine if an input document 391 belongs to a 
specific domain or not. FIG. 9 shows how a document 
domain specific classifier reasons. The system will accept 
the input document, then attribute extractor 400 collects all 
attributes occurring in the input document and sends them to 
documcnt-domain-spccific classifier's reasoner 134. The 
reasoner 134 performs classification based on the knowledge 
it learned and give back answer "Yes" or "No" to indicate 
the input document belongs to or not belongs to the specific 
domain. 

The other kind of Bayes classifier 120 is the query topic 
specific classifier, which is used to identify input query's 
topic. FIG. 10 describes how a query-topic-specific classifier 
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determines input query's topic. The input query is sent to 
attribute extractor 400, which collects all attributes in the 
query. Reasoner 134 performs classification of query's topic 
based on the knowledge stored at the training set 440, inputs 

5 an attributes collection, and generates a list 450 of candidate 
topics ranked in order of probability of correctness. The 
system can decide to choose only the first one or a few of 
topics as input query's topic(s). 
Self-organizing feature maps are well-understood and 

10 computationally tractable systems for providing statistical 
and probabilistic modeling of large, complex sets of data 
with minimal computational overhead. In contrast to other 
probability distribution mapping techniques, the self- 
organizing feature map processes only a single input pattern 

15 at a time, with no need to retain large data sets in memory 
for global processing across the entire collection of data. 

In the document clustering component 180 the input 
pattern, in this case a pattern of ontological concepts gen- 
erated by the ontological parsers 120 and 140, is formulated 

20 in the form of a vector. This vector is fed into a simple, 
two-layer self-organizing feature map, with small, randomly 
assigned adaptive weights allocated to each node in the 
feature map, and to each input connection from the input 
vectors to each node. For each such node in the map, the dot 

25 product (a measure of relative closeness) is calculated 
between the input vector V and the weight vector W of the 
weights on the input connections for that node. The node 
with the largest dot product is the node with a weight vector 
that most closely aligns in concept pattern space with the 

30 input vector. This node, plus those nodes in its immediate 
physical vicinity within the feature map, adjusts their weight 
vectors W via a simple learning algorithm: 

35 In this algorithm, k is the learning parameter, which varies 
slowly throughout the training process. Because this training 
process is computationally extremely simple, it can be 
performed in real-time environments with little operational 
time penalty. In addition, the size of the vicinity of the 

40 feature map being adjusted with each input pattern is simi- 
larly adjusted, typically reduced, until only a single node's 
weight vector is adjusted with each new pattern. 

The effect of this is that after a substantial number of input 
patterns are processed (or after a smaller number of input 

45 patterns are repeatedly processed), the distribution of the 
weight vectors in concept pattern space is an accurate model 
of the probability distribution function of the input patterns 
of the vectors used during this training. Because the training 
processing is so simple, it's also practical to leave training 

so on constantly, thus allowing the feature map to continually 
adapt to changing distributions of patterns. 

The document clustering component 180 makes use of the 
results of the document ontological parser 140. Each docu- 
ment stored in the Data repository 150, and each document 

55 matched to a user query, must first be processed by the 
ontological parser 140 to determine that it is such a match. 
As a result, a collection of predicates is associated with that 
document. F l*he pattern those predicates make within a 
document constitutes the concept pattern of that document. 

60 The concept pattern of a document is more revealing than 
any single predicate within the document when trying to 
determine the contents of a document. Thus, the document 
clustering component 180 can be used to provide a more 
in-depth, query-specific search when the basic ontological 

65 parsing cannot winnow the search space sufficiently. 

As the concept based search and retrieval system 100 
begins to produce potential matches for a specific user query, 
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the document clustering component 180 begins to train itself classifier is able to determine membership criteria dynami- 

on those matches. Each document is represented by one or cally from the documents in the training set, after which the 

more concept pattern vectors, and, as a document is added classifier is considered trained. Once a trained classifier is 

to the list of possible matches, those concept pattern vectors produced, it can be used by a Bayes classifier 130 to 

arc fed to a self-adaptive feature map constructed specifi- 5 determine whether a document belongs to the domain in 

cally for this query. The nascent feature map self-adapts to which il was lrained - ™ s lv P e of classifier is very fast and 

these concept pattern vectors as they appear, so that by the ver V accurate in determining domain membership on a very 

time the final document match is located, the feature map is lar S e ^ ° f documents. The results of this training process 

mostly, or even folly, trained. At this stage, the feature map "[^^30 1 data re P° sltor y 150 for use b * * c Ba y es 

represents clusters of documents, which are relatively more 10 . . . , . . , , . , . . . u 4 

1 • -i u u In the document-indexing mode a search index is built to 
or less similar to each other * , . . store documents as predicates, which can later be used to 
When it is determined that the set of located matches efficientl match user queries t0 indexed documents. This 
exceeds a user-specified too many matches parameter, the mode exercises all of the components provided by the 
document clustering component 180 selects one example concept-based search and retrieval system 100, and demon- 
document from one of the larger clusters of documents 15 strates all of the benefits of the present invention. However, 
within the feature map. (This represents the most common due l0 me current state of technology (namely processing 
cluster of concept patterns, and therefore represents a rea- power, storage devices and access times, and Internet access 
sonablc guess at the documents most likely to match the speeds) other modes are used to allow users to gain real-time 
user's original query.) This document is presented to the user results from their searches. 

along with a choice of "more documents like this" or 20 Documents are indexed by the predicates they contain, 

"documents different than this " which arc equivalent to user queries. Thus, two methods are 

If the user requests similar documents, the document provided for indexing new documents, both of which are 

clustering component 180 presents those documents clus- realized in the component. The first method involves the 

tered closest to the example document within the feature retrieval of new documents to answer user queries, which do 

map, up to the user-specified maximum number. These 25 not already have documents matched to them. The questions 

documents will be the closest available matches to the user's arc parsed by the query ontological parser 120, which 

query within the searched space. produces a set of predicate structures. These predicates 

If the user requests more documents different than the contain a plurality of keywords, which may be brokered to 

example, the document clustering component 180 presents a other search facilities to retrieve their indexed documents 

new example or examples from a cluster or clusters as far 30 relating to the user's query. 

from the original cluster as possible. This, in essence, bisects Alternatively, a spider may be used to retrieve documents 

the multidimensional concept pattern space. The user can automatically from the web, without prior knowledge of 

again determine if the new example is similar to or different their content. These documents are not retrieved via any 

than the desired documents, and the partitioning of concept brokering of user queries, only through standard spidering 

pattern space continues. 35 techniques. 

This repeated bisecting of concept pattern space effi- Regardless of the method by which documents are 

ciently homes in on the exact meaning the user intended and, acquired, they are then passed into the Bayes classifier 130, 

with minimal effort, provides users with the precise results which classifies the documents according to the set of 

from the document search. available trained classifiers previously learned in the Bayes 

When the document clustering component 180 is not 40 classifier 130 Training Mode, 

actively processing user queries, it can also be used to Each of these documents is sent into the document 

pre -cluster documents stored in the data repository 150 as ontological parser 120, which parses it into a predicate 

"known documents." Because the clustering is fully library. Each predicate library is compared to the original 

automatic, it negates any need for human categorization of query predicate structurc(s) for relevancy and is assigned a 

these documents, and thus allows high-speed, efificienl, 45 score based on the relevancy. A special scoring algorithm 

clustering of documents based not only on individual predi- used specifically for the concept based search and retrieval 

catcs but also on predicate context and overall concept system 100 accomplishes this. The results arc then passed to 

patterns within the documents as a whole. the data repository 150 for later retrieval and/or comparisons 

The search and retrieval system must index documents with other queries. When no query exists to compare the 

and serve them to end-users. The five operating modes of the so document against, every predicate within the document is 

System are: treated as a predicate library to be scored. 

Exact match (end user mode, online mode) The document-indexing mode is fully automated. Once 

Document predicate match mode (offline mode) 1™™* are made a ™ lable 10 th ° conce P l ba f d search a " d 

„ . . , , n> . \ retrieval system 100 or a spider is brought online, the 

Bayes tra.ning mode (offline mode), 5J indexjng operation fequires m {mhef human action tQ 

Document indexing mode (online mode), and occur. This indexing mode is more efficient than manual 

Maintenance mode (offline mode). methods used by current search directory companies 

The operating modes represent the various ways in which because topic editors do not have to manually categorize 

the previously discussed components can interact. web documents. 

In Bayes training mode, shown in FIG. 7, the object is to 60 In this mode, query ontological parser 120 parses the 

train the Bayes classifier 130 to learn membership criteria user's query and produces a plurality of predicates, depen- 

for a specific topic. This is the mode in which new domain dent upon the length of the user's query and the number of 

classifications arc acquired through the statistical learning entities it contains. The query ontological parser 120 does 

techniques discussed previously. In this mode, topic editors not perform any generalizations of the users query predicate 

provide training data in the form of specific documents 65 (s) in this mode. 

known to belong to the selected topic or domain. Itiese Next a search is performed on the data repository 150 for 

documents are used to train the Bayes classifier 130. The any stored documents that contain any of the predicates 
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produced by query ontological parser 120. In the event of a This predicate is then passed through the parser filters 

match, the documents matching the query are sent to a 125, where it successfully passes the parse probability and 

ranking and formatting component (not discussed) to pro- selection a 1 feature compatibility tests. After that, it is stored 

duce the search results in a user-friendly format, this is then in a predicate library, and passed to the data repository 150. 
returned to the user. 5 Suppose that a user asks the question, "Do octopuses have 

An exact match occurs when a query posed by the end hearts?" 

user produces an exact match, meaning that the predicate The question will be read by the sentence lexer 122, and 

structures found in the query are also present in at least one a sentence made of ontological entities is produced. It reads: 

document in the data repository 150. Do-verb octopus-noun have-verb heart-noun 

The document predicate match mode is nearly identical to i n the lexer filters 123, the PseudoPredicateFilter removes 

exact match mode, with only one difference. In this mode the the first verb, "do", because it is not the main verb of the 

query ontological parser 120 is allowed to create generalized sentence. "Do" only serves to fill a grammatical role within 

predicates, based on the user's original query. This entails this type of question, and is thus removed, leaving: 

the substitution of synonymous words for all concepts in the octopus-noun have-verb heart-noun 

user's query This allows for a broader document search to ^ ^ idemical l0 the senlence produced above , and 

be pcrformei ^, asonly relationships have to be matched, not 15 ^ {hc same and ^ ^ dicate 

specific words. ITns mode generates multiple queries as a stmcture ^ wh J ^ ontologicaI p ^ cr 120 

result of a single user query, and thus requires more pro- . . . . ». . /. . . r . t 1CA 

M * nrsA ,- J « rt ' f ««L-«„ti„ ne . receives this question, it will enable the data repository 150 

ccssine power and time to pcrrorm as ctncicntly as exact « . . 7 • • . ■ • n 

match mode. For this reason. The concept based search and J°. hnd ' ne document containing the sentence originally 
retrieval system 100 only switches to document predicate 20 cliscussed ' 

mode when no exact matches arc found. However, the Suppose the system is to use the Bayes classifier 130 to 

concept based search and retrieval system 100 may employ create a Finance-domain-specific classifier that can identify 

document predicate mode instead of exact match mode as whether or not a given document belongs to Finance 

advances in hardware relieve these performance concerns. domain. The construction of Finance-domain-specific clas- 
Oncc generalized predicates have been created by the 25 sifier includes two stages: example collection stage and 

query ontological parser 120, a search for document predi- classifier training stage. 

cates that match these generalized query predicates is per- FIG. 12 is a flow diagram illustrating the first stage of 

formed. After all matches have been gathered, the links to constructing a Finance-domain -specific classifier, 

documents pass through the same ranking and formatting Topic editors collect a set of example documents contains 

components used in exact match mode. both documents relevant to and documents irrelevant to 

The maintenance mode becomes active when queries do Finance domain. This operation is shown in step 500 of FIG. 

not result in enough document matches to meet 12 M shown in step 502> topic editor will classify each 

requirements, or when no matches at all are returned. When example document. If an example document belongs to the 

that happens, the end user may optionally leave an e-mail rinance domain> tQ ic MtOT assigns value « Y es" to it; if a 

address with the system^ When new documents that match le documeQt irrelevant t0 me Finance domaill) topic 
the query are located the end user is notified via 1 E-mail 35 ^ ^ ^ yalue w > . The ^ 

This mode permits retraining the Bayes classifier 130 with „ A . f» . t . . 

better documents, or processing more documents with the s ' e P 510 , ? cl V de » P™ ce « es to parse each example 

query ontological parser 120 to produce a better data reposi- document, eliminate the case sensitivity problem; filter stop 

tory 150 words and collect all other words in document. All collected 

'Hie following is an example of a sentence and demon- 40 words from an exam P le document are treated as a set of 

strates both how it is parsed as a sentence within a document attributes representing these example documents. All 

(for storage within the data repository 150), and how a attributes are in lower case. 

question would produce matching predicates to retrieve the A Finance Example set 410 is generated and stored into a 

document containing this sentence. data repository is in step 515. A Finance Example set 410 is 
JTie example sentence is: 4S essentially a set of examples, each of which consists of a set 

The octopus has a heart. of attributes that represent the document and the value "Yes" 

First, the sentence lexer 122 would process this sentence. or "No" which is assigned by topic editor during step 502 
The first component of the sentence lexer 122, the document 

iterator 210, would extract this sentence from the document TABLE 2 

it was contained in. At this stage, it would exist as the text 

string shown above. Following that, it would be passed to Document Attributes Value 

the lexer 122, which would access the ontology 128, and , ^ ^ ^ rally sU)ckj m ^ Yes 

return the sequence. Friday, investors, cheered, agreement, Lawmakers, 

The-det OCtOpUS-noun have-verb a-det heart-noun. clinton, administration, remove, key, obstacle, 

Here, "det" stands for determiner, which is a word with a ^ l ! gisI J ati °"' wc [ h J ul »_ fi ™ nc | al » services, industry 

purely grammatical function, namely specifying a noun " " m ~~^,J^ ni ~~' 

phrase. The other tags, noun and verb, indicate parts of rcpIyfpreviouT 

speech with ontological content. Thus, when the sentence ! 

passes through the lexer filters 123, the stop WordFilter 

removes "a" and "the", leaving: Table 2 is a sample Finance Example Set which contains 
octopus-noun have-verb heart-noun 60 two documents, one belongs to Finance domain and the 

The senlence is then taken up by the senlence receiver other irrelevant with Finance domain. After generating a 

310, which passes it to the parser 124. In the parser 124, the Finance Example Set, the classifier's construction goes to 

following tree shown in FIG. 11. the next stage, classifier training stage. 

A parse tree converter 450 then converts this tree into a FIG. 13 is a flow diagram shows a Finance-domain- 
predicale, where oclopus is the subject of have, and heart is 65 specific classifier's training process. As shown in step 600, 

the object. The predicate is: Finance Example Set, which contains attributes and values 

have<octopus, heart> of example documents, are retrieved from data repository. 



55 2 canada, 21, sorted, agent, sparking, next, message, No 
biological, mbox, thread, messages, control, thistle, 
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Step 602 includes a process to collect distinct attributes and 
values from Finance Example Set. Define set variables A and 
C; set 

A«-all distinct attributes in Finance Example set 410 
C*-{Ycs, No} — distinct values in Finance Example set 
410 

Step 605 includes a process to calculate value probability 
P("Yes") and value probability P("No"). P("Yes") represents 
the percentage of documents relevant to Finance domain that 
exist in the Finance Example Set, and P("No") represents the 
percentage of documents irrelevant to the Finance that exist 
in the Finance Example Set. En step 607 the conditional 
attribute probability P(wJ"Yes") and P(wJ"No") for each 
attribute W a are calculated. 

As shown in block 620, a Finance Training Set is gener- 
ated and stored into a data repository 150. An Finance 
Example set 410 will contains value probabilities P("Yes") 
and P("No"), a pair of conditional attribute probability 
P(wJ"Yes") and P(w ( .|"No") for each attribute W ( - in A. 

A populated Finance Training Set represents the knowl- 
edge of the Finance-domain-specific classifier. 

Based on the probabilities data stored at Training Set, 
Finance-domain-specific classifier can determine if an input 
document belongs to Finance domain or not. 

The construction of a Finance-domain-specific classifier 
is now finished. Next, how the Finance -domain -specific 
classifier is used to classify documents is discussed. 

FIG. 14 is a flow diagram illustrating Finance-domain- 
specific classifier classification process. As shown in the 
FIG. 14, the concept based search and retrieval system 
accepts an input document for classification, in step 700. 
Suppose the input document is a story form the CNN news 
on Wednesday, Mar. 1, 2000: 

Technology stocks soared in heavy trading Wednesday, 
pushing the Nasdaq composite index to its 13th record 
of the year as money poured into the economy's hottest 
chipmakers, telecommunications companies and bio- 
technology firms. 

The attribute extractor 400 collect attributes from this 
input documents in step 705. Step 705 further includes 
processes to convert all words into lower case in order to 
eliminate case sensitivity problem, to filter stop words and 
collect all other words. Table 3 contains all attributes col- 
lected from input document after those processes. 

TABLE 3 

Technology, stocks, soared, heavy, trading, 
Wednesday, pushing, nasdaq, composite, index 
13 ,h , record, year, money, poured, economy, 
hottest, chipmakers, telecommunications, 
companies, biotechnology, firms 



Ine Finance Training Set, which represent the knowledge 
of this classifier, is retrieved from the data repository 150 in 
step 710. In step 720, the process includes the computation 
of the answer probability P("Yes"|Doc) and P("No"|Doc) 
according to following equations: 

/\" Yes" | D) = rtYes" \w x ,w l w n ) = /T Yes")P] T\»i I "YES") 

n 

/TNo" I U) = /TNo" I "i , w n ) = >T*No")[~| f\wi I "No") 



In which w A , w 2 , . . . , w M are attributes represent the input 
document. Here are attributes defined at Table 2. The value 
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probability P(Yes) and P(No), and conditional attribute 
probability P(w,.|"Yes" and P(w | .| << No" can be retrieved from 
Finance Training Set. 

Finally the classifier compares P("Yes"|D) and P("No"|D) 
5 in step 725. 

If P("Yes"|D)>P("No"|D) then it returns TRUE in step 
730 to indicate the input document belongs to Finance 
domain, otherwise it returns FALSE, in step 735, to indicate 
the input document is irrelevant to the Finance based search 
and retrieval system 100 to identify the input query's topic. 
In general, the system will predefine a set of topics called the 
Topic Set. Suppose the system has defined a Topic Set 
containing finance, science, and business as three topics. The 
system uses the Bayes classifier 130 to construct a query 
topic specific classifier that can identify whether an input 
15 query belongs to one of these topics. 

Procedures for constructing a query-topic-specific classi- 
fier are the same as procedures for constructing a document 
domain specific classifier. 

FIG. 15 is a flow diagram showing how to generate Query 
20 Topic Example set 410. Topic editors create a set of example 
queries in step 800, such as: 

TABLE 4 



"What happened to stock market?" 

"Where can I find a list of venture capitalists?" 

"Where can I find a concise encyclopedia article on orbits?" 

"Where can I learo about the math topic geometry?" 

"Where can I find a directory of Web sites for hotels in Nevada?" 



30 The topic editor assigns a value to each query based on his 
(her) own knowledge in step 805. This also includes a 
process to extract concepts from the entered user query and 
expand those concepts by looking up a concept ontology that 
contains hierarchical concepts in step 810. For example, a 

35 query like "What happened to stock market?" can be con- 
verted to a predicate structure containing concepts "hap- 
pened" and "stock market." With a concept ontology 128 
that contains hierarchical concepts, "happened" may expand 
to include "took place" and "occurred," while "stock mar- 

40 ket" may expand to include "stock exchange," "securities 
market," and "Nasdaq." 

A Query Topic Example set 410 is generated and stored 
into a data repository 150 for later use in step 820. Table 4 
is a sample Query Topic Example set 410 generated for the 
sample queries listed in Table 5. 



50 



55 



TABLE 5 


Query 


Attributes 


Vblue 


1 


"happened", "took place", "occurred", "happen", 


"finance" 




"take place", "occur", "stock Market", 






"stock exchange", "securities market", "Nasdaq", 






"New York Stock Exchange", "stock", 




2 


"venture capitalists", "speculator", "plunger" 


"finance" 


3 


"concise", "encyclopedia", "article", "orbits", 


"science" 




"brief, "cyclopedia", "reference book' 




4 


"learn", "math", "topic", "geometry", 


"science" 




"mathematics" 




5 


"directory", "Web sites", "hotel", "Nevada", 


"business" 




"motel", "web page". . . 





60 Training processes for the query topic specific classifier 
are the same as training processes for document-do ma in- 
specific classifier described earlier. FIG. 16 is a flow dia- 
gram shows how to train a query topic specific classifier. A 
query topic example set is stored in the data repository 150 

65 in step 850. 

Next, distinct attributes and values are collected in step 
855. The value probability is calculated for each distinct 
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value in the query topic example set, in step 860. In step 865, might also be classified as a real estate document if the story 

for each distinct attribute in the query topic example set, the was about how interest rate changes would affect mortgages 

attribute value conditional probability is calculated. The and new home buying. 

query topic training set is then generated and stored in the In this example, a range of available classifiers has not 
data repository 150 in step 870. 5 been defined. However, it is still worthwhile to explain why 
FIG, 17 is a flow diagram showing how a trained query this sentence would be likely to be classified as financial, 
topic specific classifier identifies the topic of an input query. Each word is independently considered as an attribute of the 
The whole process is very similar to Lhc operation of the document. Thus, the facts that the word "financial" appears 
document domain specific classifier. The trained query twice, the word "stock" appears once, the word "shares" 
topic-specific classifier accepts an input query in step 900. 10 appears once, and the word "investors" appears once mean 
Next the attribute extractor 400 collects attributes from the that there are five financial attributes for the document from 
input query in step 905. In step 910, the classifier accesses lhis sentence alone. When this is compared with the financial 
the query topic training set stored in the data repository 150. training set it is clear that this document contains many 
The answer probability is calculated for each possible matches with the financial category, 
answer value, in step 920. „ A M{ ™ being classified, the sentence then goes to the 
Tlie query topic specific classifier returns a list of topic 15 ontological parser 140 where it is transformed as 
... . r . .-I-. c . . j c discussed previously. Because of the length of the sentence, 
ranked in order oi probability or correctness instead or a ... r , c J , • j * •» t n *u 
i (lf m,fn. ur^AT^r» . <v»*? ^ and the number of rules required to parse it successfully, the 
siogle TRUE or FALSE answer m step 925. For (ree lha( WQuld b{ . ^nocated ftom the example sen- 
example, for an input query like What is the current stock lence b not shown Ho wever, the resulting predicates would 
price of IBM? the Query Topic Specific classifier may 2 q include- 

generate a topic probability table as follows: lead<fina D cial shares, strong rally* 

in<strong rally, stock market> 
in<stock market, Friday> 
cheer<investors, agreement, lawmakers> 
cheer<investors, agreement, Clinton administration> 
remove<agreement, key obstaclo 
to<key obstacle, legislation> 
overhauUlegislation, financial-services industry> 
The system can predefine a number n to indicate how 30 Not all of these words or phrases would be included in the 
many values will be returned, llien the first n topics will be ontology 128, and so the sentence lexer 122 would not be 
returned as input query's topic(s). For the above example, if able to place them all in ontological context. However, the 
the predefined value for n is one, only the first topic, finance, linguistic rules encoded in both the lexing and parsing stages 
is returned as the input query's topic. If n is two, then the still enable The concept based search and retrieval system 
system returns the first two topics, finance and business as 35 100 to recognize noun phrases such as "Clinton administra- 
te input query's topics. tion" and "financial-services industry". These phrases can 
During document indexing mode each component's func- only be retrieved through exact matches of text, since they 
tion is examined by tracing the path of a complex, but are not linked to any synonyms in the ontology 128. 
typical, sentence from the Wall Street Journal as it passes However, the predicates they occur within can still be found 
through the concept based search and retrieval system 100. 40 through the full power of the concept based search and 

Specifically, the first sentence from the lead story of the retrieval system 100 retrieval system. 

Money & Investing section of the Wall Street Journal Finally, the document is stored in the data repository 150. 

Interactive Edition on Friday, Oct. 22, 1999 is shown as Multiple references to the page are generated, each one 

follows: consisting of a predicate from the list above and a link to the 

"Financial shares led a strong rally in the stock market 45 document. The document is scored for its relevance to each 

Friday as investors cheered an agreement by lawmakers of the above predicates, and that information is also 

and the Clinton administration that removes a key maintained, to assist in ranking and formatting during 

obstacle to legislation that would overhaul the retrieval. 

financial -services industry." Retrieval is initiated by a user query submitted through 

The first stage of the search process is for the pages, which so the user interface 110. Queries may come from any HTTP- 

potentially match the user's query to be picked up by the compliant source, generally a web page. The user's query is 

search collector, a component of the search and retrieval submitted in the form of a natural-language question. For 

system whose sole responsibility is the acquisition of docu- example, the user might ask: 

ments. At this stage, the example sentence being tracked "What happened in the stock market Friday?" 

exists as a few lines of HTML text within a document 55 The query is received by the query ontological parser 120, 

downloaded by the concept based search and retrieval which converts it into one or more predicates depending on 

system 100. the search mode invoked. In exact match mode, the query is 

'Ihe next stage of the concept-based search and retrieval converted into two predicates, happen<?, stock market>and 

system is the Bayes classifier 130. The Bayesian classifier in<stock market, Friday >, These predicates are then used to 

130 takes in the text of the document, and converts it into a go search the data repository 150. Since the second predicate 

set of attributes which are compared against each of the matches one of the ones listed above, the document is 

trained classifiers within the Bayes classifier 130. A docu- returned to the user. 

mcnt may be classified as belonging to multiple categories However, the document is also relevant to questions 

if the words within the document correspond to attributes of which do not contain exact matches of any of the predicates 

several classifiers. For example, a financial page would be 65 above, such as: 

classified as such by the occurrence of words like stock, "How did investors react to the agreement on financial 

bond, Federal Reserve, and interest rate. The same page legislation?" 
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This would parse into the query predicates react 
investors, agreemenl>and on <agreernenl, financial legis- 
lation>. Neither of these matches the predicates from the 
sentence shown above, and so exact match mode would not 
return the document. However, if exact match fails to turn up 5 
appropriate results, or if document predicate mode is set as 
the default, the query ontological parser 120 will generate 
additional predicates, using both synonyms and higher-level 
concepts. One of the query predicates generated by the query 
ontological parser 120 in this mode would be 10 
judge<investors, agreement:*, as "judge" is a parent concept 
of "react". Since it is also a parent concept of "cheer", this 
query would return the document containing the example 
sentence. 

Note that the number of arguments in the predicates do 15 
not match; the document parses to cheer<investors, 
agreement, lawmakers>, which has more arguments than 
judgc<invcstors, agreements The mismatch in the number 
of arguments would result in a lower score in the ranking and 
formatting component; however, if not many documents 20 
match the query, it will still be prominently displayed as a 
match for the user's question. 

Having described several embodiments of the concept- 
based indexing and search system in accordance with the 
present invention, it is believed that other modifications, 25 
variations and changes will be suggested to those skilled in 
the art in view of the description set forth above. It is 
therefore to be understood that all such variations, modifi- 
cations and changes are believed to fall within the scope of 
the invention as defined in the appended claims. 30 

What is claimed is: 

1. A method of performing concept-based searching of 
text documents comprising the steps of: 

transforming said text documents into predicate structures 
to form predicate libraries of said documents; 35 

inputting a natural language query; 

creating a query predicate structure representing logical 
relationships between words in said natural language 
query, said predicate structure containing a predicate 
and an argument; 40 

matching said query predicate structure to said document 
predicate structures in said predicate libraries; and 

presenting said matched predicate structures from said 
text documents. 

45 

2. A method of performing concept-based searching of 
text documents as recited in claim 1, wherein said predicate 
is one of a verb and a preposition. 

3. A method of performing concept-based searching of 
text documents as recited in claim 1, wherein said argument 
is any part of speech. 

4. A method of performing concept-based searching of 
text documents as recited in claim 1, wherein said argument 
is a noun. 

5. A method of performing a concept-based searching of 
text documents comprising the steps of: 

transforming a natural language query into predicate 
structures representing logical relationships between 
words in said natural language query; 

providing an ontology containing lexical semantic infor- 60 
mation about words; 

transforming said text documents into predicate struc- 
tures; 

probabilistically classifying said document predicate 
structures and said query predicate structures; cs 

filtering said document predicate structures against said 
query predicate structures to produce a set of said 
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document predicate structures matching said query 
predicate structures; and 
ranking said set of matching predicate structures. 

6. A method of performing concept -based searching of 
text documents as recited in claim 5, further comprising the 
step of storing said ontology, said probabilistic classifica- 
tions and said predicate structures in a data repository. 

7. A method of performing concept -based searching of 
text documents as recited in claim 5, wherein words and 
associated probabilities, comprising a statistically-derived 
category, are used to determine if a particular document 
belongs to a specific domain. 

H. A method of performing concept -based searching of 
text documents as recited in claim 7, further comprising the 
step of collecting all attributes occurring in said document 
and determining if said document belongs to said specified 
domain. 

9. A method of performing concept -based searching of 
text documents as recited in claim 6, further comprising the 
steps of: 

determining a topic of said query predicate structure; 

providing a set of trained document examples from said 
data repository; 

classifying said topic based on said trained set of docu- 
ment examples; and 

providing a list of possible topics ranked in order of 
probability of correctness. 

10. A method of performing concept-based searching of 
text documents as recited in claim 5, wherein upon failure to 
match said document predicate structures to said query 
predicate structures, comparing documents added to said 
data repository or newly located ones of said documents to 
said query predicate structure, and notifying a user in the 
event of a match. 

11. A method of performing concept-based searching of 
text documents as recited in claim 5, wherein upon failure to 
match said document predicate structures to said query 
predicate structures, determining whether said query is for- 
mulated in terms not previously included in said ontology, 
and if said determination is positive, designating said query 
terms as new concepts and adding said query terms to said 
ontology. 

12. A method of performing concept-based searching of 
text documents as recited in claim 5, further comprising the 
step of clustering results of said search, said clustering step 
comprising the following steps of: 

forming a concept pattern vector from said document 
predicate structures; 

providing a feature map that self-adaptively clusters said 
concept pattern vectors according to said concept pat- 
terns in said documents; 

producing a cluster model representing documents, iden- 
tified in said concept-based searching, that reflects 
statistical distribution of said concept pattern vectors 
representing said documents; and 

providing at least one sample from said cluster model to 
focus search results. 

13. A method of performing concept-based searching of 
text documents as recited in claim 5, wherein said step of 
transforming said text documents into predicate structures 
comprises the steps of: 

removing words that serve as placeholders in English- 
language; 

removing lexemes representing adjective concepts; 
grouping proper nouns into single lexical nouns; 
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removing modal verbs; and 

removing lexemes containing adverb concepts. 

14. A method of performing concept-based searching of 
text documents as recited in claim 5, wherein said step of 
transforming a natural language query into predicate struc- 5 
lures comprises the steps of: 

removing words that serve as placeholders in English- 
language; 

removing lexemes representing adjective concepts; 
grouping proper nouns into single lexical nouns; 
removing modal verbs; 
removing lexemes containing adverb concepts; and 
removing modal verbs from said query. 

15. A method of performing concept-based searching of 
text documents as recited in claim 5, wherein said step of 
transforming said natural language query comprises the 15 
steps of: 

transforming said natural language query into multiple 
sequences of part-of-speech-tagged ontological con- 
cepts from said ontology; 

reducing the number of said multiple sequences based on 20 
rules relating to sequences of syntactic tags; 

creating syntactic tree structures, based on said syntactic 
tags, representing grammatical relations between said 
ontological concepts; and 

reducing the number of said tree structures based on rules 25 
relating to improbable syntactic structures, and rules 
concerning conflicting ontological specifications. 

16. A method of performing concept-based searching of 
text documents as recited in claim 15, further comprising the 
step of converting said tree structures into predicate struc- 30 
tures. 

17. A method of performing concept-based searching of 
text documents as recited in claim 5, wherein said step of 
transforming said text documents comprises the steps of 

transforming said documents into multiple sequences of 35 
part-of-speech-tagged ontological concepts from said 
ontology; 

reducing the number of said multiple sequences based on 
rules relating to sequences of syntactic tags; 

creating syntactic tree structures representing graramati- 40 
cal relations between said ontological concepts based 
on said syntactic lags; and 

reducing the number of said tree structures based on rules 
relating to improbable syntactic structures, and rules 
concerning conflicting ontological specifications. 4S 

18. A method of performing concept-based searching of 
text documents as recited as recited in claim 17, further 
comprising the step of converting said tree structures into 
predicate structures. 

19. A method of performing concept-based searching of 
text documents as recited in claim 12, further comprising the 50 
step of using said ontology to develop said feature map to 
cluster said concept patterns. 

20. An apparatus for use in an information retrieval 
system for retrieving information in response to a query, 
comprising: 55 

a query ontological parser that transforms a natural lan- 
guage query into predicate structures; 

an ontology providing information about words, said 
information comprising lexical semantic representation 
and syntactic types; 60 

a document ontological parser that transforms documents 
into predicate structures; 

a Bayes classifier probabilistically classifying said docu- 
ments and said query; 

adaptive filters for filtering said documents against said 65 
query to produce a set of said documents matching said 
query; and 



a ranking module for ranking said set of matching docu- 
ments. 

21. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 20, further comprising a data repository 
storing said ontology, results from said Bayes classifier, and 
said predicate structures from said document ontological 
structure. 

22. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 20, wherein said query ontological parser 
comprises: 

a sentence lexer that transforms said natural language 
query into multiple sequences of part-of-speech-tagged 
ontological concepts from said ontology; 

post-lexer filters that reduce the number of said multiple 
sequences produced by said sentence lexer, based on 
rules relating to sequences of syntactic tags; 

a parser that creates syntactic tree structures representing 
grammatical relations between said ontological con- 
cepts based on said syntactic tags; and 

post-parser filters that reduce the number of said parse 
trees based on rules relating to improbable syntactic 
structures, and rules concerning conflicting ontological 
specifications. 

23. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 21, wherein said Bayes classifier comprises 
a learner that produces a set of trained document examples 
from data obtained from said data repository. 

24. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 21, wherein said Bayes classifier comprises 
a reasoner that determines a probability that a classified 
document matches said query, 

25. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 20, wherein said Bayes classifier comprises 
a reasoner that determines a probability that a classified 
document matches said query. 

26. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 24, wherein said Bayes classifier is 
document-domain-specific so that words representative of a 
concept are used to determine if a particular document 
belongs to a specific domain, and said reasoner determines 
a probability that a pre-classified document belongs to said 
specific domain that said Bayes classifier is trained to 
classify. 

27. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 26, further comprising an attribute extractor 
that collects all attributes occurring in said documents and 
sends said attributes to said reasoner to determine if said 
documents belong to said specified domain. 

28. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 24, wherein said Bayes classifier is query- 
topic specific so that words that form said query are used to 
determine a topic of said query. 

29. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 28, wherein said Bayes classifier further 
comprises a learner that produces a set of trained document 
examples from data obtained from said data repository. 

30. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 29, wherein said reasoner classifies said 
topic based on said trained set of document examples and 
provides a list of possible topics ranked in order of prob- 
ability of correctness. 



5/12/04 EPR1.1 30-31 



US 6,6' 

31 

31. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 20, wherein said document ontological 
parser comprises: 

a sentence lexer that transforms said documents into 
multiple sequences of part-of-speech-tagged ontologi- 
cal concepts from said ontology; 

post-lexer filters that reduce the number of said multiple 
sequences produced by said sentence lexer, based on 
rules relating to sequences of syntactic tags; 

a parser that creates syntactic tree structures representing 
grammatical relations between said ontological con- 
cepts based on said syntactic tags; and 

post -parser filters that reduce the number of said parse 
trees based on rules relating to improbable syntactic 
structures, and rules concerning conflicting ontological 
specifications. 

32. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 20, further comprising a persistent agent 
maintaining at least one of said predicate structures 
extracted from said query, 

wherein, upon failure to match said documents to said 
query, documents added to said data repository or 
newly located ones of said documents parsed by said 
document ontological parser are compared to said at 
least one predicate structure extracted from said query, 
and a notification is sent to a user upon a match. 

33. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 20, further comprising a persistent agent 
maintaining at least one of said predicate structures 
extracted from said query, 

wherein, upon failure to match said documents to said 
query, a determination is made whether said query is 
formulated in terms not previously included in said 
ontology, and if said determination is positive, said 
query terms are designated as new concepts and added 
to said ontology. 

34. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 20, wherein said ranking module determines 
similarity between said query and each of said documents 
returned from said data repository. 

35. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 20, wherein said ranking module determines 
similarity between said predicate structure of said query and 
each predicate structure of said documents returned from 
said data repository. 

36. An apparatus for use in an information retrieval 
system for retrieving information in response to a query 
comprising: 

a query ontological parser that transforms a natural lan- 
guage query into predicate structures; 

an ontology providing information about words, said J: 
information comprising syntactic uses and definitions; 

a document ontological parser that transforms documents 
into predicate structures; 

a Bayes classifier probabilistically classifying said docu- 
ments and said query; 60 

adaptive filters for filtering said predicate structures of 
said documents against said predicate structures of said 
query to group said documents according to similarity 
of concept patterns contained in said documents rela- 
tive to said query or additional feedback; and 65 

a ranking module for ranking said set of matching predi- 
cate structures. 
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37. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 20, wherein said predicate structures for 
each of said documents forms at least one concept pattern 

5 vector for each of said documents. 

38. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 36, wherein said predicate structures for 
each of said documents forms at least one concept pattern 

10 vector for each of said documents. 

39. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 20, wherein said query predicate structure 
and said document predicate structures comprise a predicate 

15 and an argument, said predicate is one of a verb and a 
preposition, and said argument is any part of speech. 

40. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 36, wherein said query predicate structure 

20 and said document predicate structures comprise a predicate 
and an argument, said predicate is one of a verb and a 
preposition, and said argument is any part of speech. 

41. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 

25 recited in claim 20, wherein said adaptive filters comprise a 
feature map that clusters said matching documents accord- 
ing to concept patterns in said query and produces a cluster 
model representing a statistical probability distribution of 
said matching documents. 
3Q 42. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 22, wherein said post-lexer filters comprise: 
a stop word filter that removes words that serve as 
placeholders in English-language; 
35 an adjective filter that removes lexemes representing 
adjective concepts; 
a proper noun filter that groups proper nouns into single 

lexical nouns; 
a modal verb filter that removes modal verbs; 
40 an adverb filter that removes lexemes containing adverb 
concepts; and 

a pseudo-predicate filter that removes verbs from said 
queries. 

43. An apparatus for use in an information retrieval 
45 system for retrieving information in response to a query as 

recited in claim 31, wherein said post-lexer filters comprise: 
a stop word filter that removes words that serve as 

placeholders in English-language; 
an adjective filter that removes lexemes representing 

adjective concepts; 
a proper noun filter that groups proper nouns into single 

lexical nouns; 
a modal verb filter that removes modal verbs; and 
an adverb filter that removes lexemes containing adverb 
concepts. 

44. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 22, wherein said parser comprises a parse 
tree converter for converting parse trees into predicate 
structures. 

45. An apparatus for use in an information retrieval 
system for retrieving information in response to a query as 
recited in claim 31, wherein said parser comprises a parse 
tree converter for converting parse trees into predicate 
structures. 

***** 
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